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Abstract 

Background: Two animals from the same litter are often more alike compared with two animals from separate 
litters. This litter-to-litter variation (i.e. litter effects) can be either naturally occurring or induced by applying 
an experimental treatment to whole litters rather than to the individual offspring. An example of the latter is 
the valproic acid (VPA) model of autism, where the disease phenotype in offspring is caused by giving VPA to 
pregnant females. In this case, the sample size is the number of pregnant females and not the number of offspring 
derived from them. If such experiments are not appropriately designed an analysed, the results can be severely 
biased as well as extremely underpowered. 

Results: A review of the VPA literature showed that only 9% (3/34) of studies correctly determined the sample 
size. In addition, litter effects accounted for up to 61% (p <0.001) of the variation in behavioural outcomes, 
which was much larger than the treatment effects. In addition, few studies reported using randomisation (12%) 
or blinding (18%), and none indicated that a sample size calculation or power analysis had been conducted. 
Conclusions: Litter effects are common, large, and ignoring them can make replication of findings difficult and 
can contribute to the low rate of translating preclinical in vivo studies into successful therapies. Only a minority 
of studies reported using rigorous experimental methods, which is consistent with much of the preclinical in vivo 
literature. 
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Background 

Numerous animal models (lesion, transgenic, knock- 
out, selective breeding, etc.) have been developed 
for a variety of psychiatric, neurodegenerative, and 
neurodevelopmental disorders. While many of these 
models have been helpful for understanding disease 
pathology, they have been less useful for discov- 
ering potential therapies, or for predicting which 



treatments will be useful in the clinic. Translation 
from in vivo animal models (typically rodent) has 
been poor, despite many years of research and ef- 
fort. There are many reasons for this, including the 
inherent difference in biology between rodents and 
humans [l], particularly relating to higher cogni- 
tive functions. In addition, there is the ever-present 
question of whether a particular animal model is 
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even suitable; whether it captures the disease pro- 
cess of interest or faithfully mimics key aspects of 
the human condition. While important, these two 
considerations will be put aside and the focus will 
be on the design and analysis of a key aspect of pre- 
clinical studies using multiparous species, and the 
role that this has on the reproducibility of results. 
There are two issues that will be discussed. The 
first deals with designs where an experimental treat- 
ment is applied to whole litters rather than to the 
individual animals, usually because the treatment 
is applied to pregnant females and therefore to all 
of the offspring. The second is the natural litter- 
to-litter variation (i.e. litter effect) that is often 
present, which means that the value of a measured 
experimental outcome is influenced by the litter that 
the animal came from. 



Applying treatments to whole litters 

Some disease models have a distinctive experimental 
design feature: the treatment is applied to pregnant 
females (and therefore to all of the unborn animals 
within that female), but the scientific interest is in 
the individual offspring (Figure 1). Here, the "treat- 
ment" refers to the experimental manipulation that 
induces the disease features, and it does not refer 
to a therapeutic treatment. This design is com- 
mon in toxicology and nutrition studies, but also 
used in neuroscience studies when examining the 
effects of maternal stress and in the valproic acid 
(VPA) model of autism. Difficulties arise because 
the experimental unit ( "n" ; defined as the small- 
est physical unit that can be randomly assigned to 
a treatment condition) is the pregnant dam and 
not the individual offspring [2-11 . In other words, 
the sample size is the number of dams, and the 
offspring are considered subsamples, much like the 
left and right kidney from a single animal do not 
represent a sample size of two (n — 1, but there 
are two replicate measurements). This may come 
as a surprise, and it is irrelevant that the scien- 
tific interest is in the offspring, or that the offspring 
eventually become individual entities (unlike kid- 
neys). Regulatory authorities have clear guidelines 
on the matter 
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for example, the Organisa- 
tion for Economic Co-operation and Development 
(OECD) has made a firm statement in their guide- 
lines for chemical testing: "Developmental studies 



using multiparous species where multiple pups per 
litter are tested should include the litter in the sta- 
tistical model to guard against an inflated Type I 
error rates. The statistical unit of measure should 
be the litter and not the pup. Experiments should 
be designed such that littermates are not treated 
as independent observations [p. 12]" [13]. There is 
a restriction on randomisation because only whole 
litters can be assigned to the treatment or control 
conditions, which has implications for how studies 
are designed and analysed. An appropriate analy- 
sis can be conducted by using only one animal per 
litter (randomly selected), which allows standard 
methods to be used (e.g. t-test, AN OVA, etc.). 
This is often not the most efficient design in terms 
of animal usage, unless the excess animals can be 
used for other experiments. A second option is to 
use more than one animal per litter, and then aver- 
age the values of the animals within a litter. These 
mean values can then be taken forward and analysed 
using standard methods. A third option is to use 
multiple animals per litter, and then analysis is per- 
formed with a nested or hierarchical model, which 
properly handles the structure of the data (i.e. ani- 
mals are nested within litters) and avoids artificially 
inflating the sample size (also known as pseudorepli- 
The third method is preferred over 
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cation 

averaging values within a litter because litter is en- 
tered as a variable in the analysis and the magnitude 
of the litter effect can be quantified. In addition, in- 
formation on the precision will be lost by averaging, 
but is retained and made use of in the hierarchical 
model. When using the first two options, it is clear 
that to increase the sample size and thus power, the 
number of litters needs to be increased. This is also 
true for the third option, but may not be so readily 
apparent |9l pp. 3-4], and is discussed further below. 



A related design issue is that greater statistical 
power can be achieved when littermates are used to 
test a therapeutic compound versus a placebo. If 
the therapeutic treatment is applied to the individ- 
ual animals postnatally, then the individual animal 
is the experimental unit for this comparison. This is 
referred to as a split-plot design and has more than 
one type of experimental unit: litters for some com- 
parisons and individual animals for others. These 
studies therefore require careful planning and anal- 
ysis, but biologists are rarely introduced to these 
designs and how to appropriately analysed them 
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during the course of their training. 



Litter effects are ubiquitous, large, and important 

It is known that on many variables and across many 
species, monozygotic twins are more similar than 
dizygotic twins, which are more similar than non- 
twin siblings, and which in turn are more similar 
than two unrelated individuals. What has not been 
fully appreciated is that all of the standard statisti- 
cal methods (e.g. t-test, ANOVA, regression, non- 
parametric methods) assume that the data come 
from unrelated individuals. However, rodents from 
the same litter are effectively dizyogtic twins; they 
are genetically very similar and share prenatal and 
early postnatal environments. Therefore studies 
need to be designed and analysed in such a way that 
differences between litters do not bias or confound 
the results [2-11. More specifically, this relates 
to the assumption of independence of observations. 
For example, measuring blood pressure (BP) from 
the left and right arm of ten unrelated people only 
provides ten independent measurements of BP, not 
twenty. This is because the left and right BP values 
will be highly correlated — if the BP value measured 
from a person's left arm is high, then so will the 
value measured from their right arm. Similarly, two 
animals from the same litter will tend to have val- 
ues that are more alike (i.e. correlated) than two 
animals from two different litters. This lack of in- 
dependence needs to be handled appropriately in 
the analysis and the three strategies outlined in the 
previous section can be used. Many animal models 
are derived from highly inbred strains, and this re- 
sults in reduced genotypic and phenotypic variation. 
This is a different issue and unrelated to lack of 
independence. It does not mean that animals "are 
all the same" and that differences between litters do 
not exist. 



Litter effects are not a minor issue that only 
statistical pedants worry about, with little practi- 
cal importance for scientists. Using actual body 
weight data from their experiment Holson and 
Pearce showed twenty years ago that if three treated 
and three control litters are used, with two offspring 
per litter (total number of offspring = 12), then 
the false positive rate (Type I error) is 20% rather 



than 5% [3]. Furthermore, the false positive rate 
increases with the number of offspring per litter: if 
the number of offspring per litter is 12 (total num- 
ber of offspring = 72) then the false positive rate 
is 80%. The error rate is also influenced by the 
relative variability between and within litters and 
will therefore vary for each experimental outcome. 
Nevertheless, given that papers report the results 
of multiple tests (multiple outcome variables and 
multiple comparisons), we can expect the literature 
to be rife with false positive results. It may seem 
paradoxical, but in addition to too many false posi- 
tives, ignoring litter-to-litter variation can also lead 
to low power (too many false negatives) when true 
effects exist [3j[4] . This is because differences be- 
tween litters ends up as unexplained variation, and 
thus the "noise" in the data is increased, poten- 
tially masking true treatment effects. A subsequent 
study in 1997 using forty litters found "significant 
litter effects. . . in varying degrees, for almost ev- 
ery behavioural, morphologic, and neuroendocrine 
measure; they were evident across indices of neural, 
adrenal, thyroid, and immunologic functioning in 
adulthood" [I] (and see references therein for fur- 
ther studies supporting this conclusion). Holson 
and Pearce reported that only 30% of papers in the 
behavioural neurotoxicology literature correctly ac- 
counted for litter effects [3] and Zorrilla noted that 
34% of papers in Developmental Psychobiology cor- 
rectly accounted for litter effects and only 15% of 
papers in related journals [I]. This issue has been 
brought up repeatedly for almost forty years |2 , 
but has largely been ignored by experimental biol- 
ogists. One can only speculate on the number of 
erroneous conclusions that have been reached, and 
the resources that have been wasted. 



One might argue that when many studies are 
conducted, including replications within and be- 
tween labs, the evidence will eventually converge to 
the "truth", and therefore these considerations are 
only of minor interest. Unfortunately, there is no 
guarantee of such convergence, as the literature on 
the superoxide dismutase (SOD1) transgenic mouse 
model of amyotrophic lateral sclerosis (ALS) demon- 
strates. Several treatments showed efficacy in this 
model and were advanced to clinical trials, where 
they proved to be ineffective 



15 . A subsequent 



large-scale and properly executed replication study 
did not support the previous findings 16 . This 
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study also identified litter as an important vari- 
able which affected survival (the main outcome), 
and which was not taken into account in the earlier 
studies. The authors also demonstrated how false 
positive results can arise with inappropriate experi- 
mental designs and analyses. Litter effects were not 
the only contributing factor; a meta-analysis of the 
preclinical SOD1 literature revealed that only 31% 
of studies reported randomly assigning animals to 
treatment conditions, and even fewer reported blind 
assessment of outcomes [17| . Lack of randomisation 
and blinding are known to overstate the size of treat- 



ment effects 18-22 . In addition, there was evidence 



of publication bias, where studies with positive re- 
sults were more likely to be published 17 . Thus, 



the combination of poor experimental design, anal- 
ysis, and publication bias contributed to numerous 
incorrect decisions regarding treatment efficacy. 



Methods 
Literature review 

Primary research articles that injected pregnant 
dams with VPA and subsequently analysed the ef- 
fects in the offspring were identified on PubMed 
using the search term "(VPA OR 'valproic acid') 
AND autism" (up to the end of 2011). Reference 
lists from these articles were then examined for fur- 
ther relevant studies. A total of thirty-five studies 
were found, and one was excluded as key informa- 
tion was located in the supplementary material, but 



this was not available online 39 . Two key pieces of 
information were extracted: (1) whether the anal- 
ysis correctly identified the experimental unit as 
the litter, and (2) whether important features of 
good experimental design where mentioned, includ- 
ing randomisation, blinding, sample size calculation, 
and whether the total sample size (i.e. number of 
pregnant dams) was indicated or could be deter- 
mined. 



General quality of preclinical animal studies 

Previous studies have shown that general quality of 
the design, analysis, and interpretation of preclin- 
ical animal experiments is low [19j[20j[22 -29 . For 



example, Nieuwenhuis et al. recently reported that 
50% of papers in the neuroscience literature mis- 
interpret interaction effects 30 . In addition, the 



issue of "inflated n", or pseudoreplication, shows up 
in other guises 11 31 , and whole fields can misat- 



tribute cause-and-effect relationships 32 33 . There 



is also the concept of "researcher degrees of free- 
dom", which refers to the flexibility that scientists 
have in choosing the main outcome variables, sta- 
tistical models, data transformations, how outliers 
are handled, when to stop collecting data, and what 
is reported in the final paper 34 . Various permu- 



tations of the above options greatly increases the 
chances that at least something will be statistically 
significant, and this is what tends to get reported 
as the sole analysis that was conducted. Given the 
above concerns, it is not surprising that the phar- 
maceutical industry has difficulty reproducing many 



published results 35-38 . 



Estimating the importance of litter-to-litter varia- 
tion 

Data from Mehta et al. [40] were used to estimate 
the magnitude of differences between litters on a 
number of outcome variables. This study was cho- 
sen because it included animals from fourteen litters 
(five saline, nine VPA) and therefore it was pos- 
sible to get a good estimate of the litter-to-litter 
variation. In addition, the study mentioned using 
randomisation and blind assessment of outcomes. 
Half of the animals in each condition were also 
given MPEP (2-methyl-6-phenylethyl-pyrididine), 
a metabotropic glutamate receptor 5 antagonist. To 
assess the magnitude of the litter effects, the effect 
of VPA, MPEP, and sex (if relevant) were removed, 
and the remaining variability in the data that could 
be attributed to differences between litters was esti- 
mated. More specifically, models with and without 
a random effect of litter were compared with a likeli- 
hood ratio test. This analysis is testing whether the 
variance between litters is zero, and it is known that 
p-values will be too large because of "testing on the 
boundary" , and so the simple method of dividing 
the resulting p-values by two was used as recom- 
mended by Zuur et al 41 . The exact specification 



of the models is provided as R code in Additional 
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File 1 and the data are provided in Additional File 2. 



Power analysis 

In these types of designs, power (the ability to detect 
an effect that is actually present) is influenced by (1) 
the number of litters, (2) variability between litters, 
(3) number of animals within litters, (4) variabil- 
ity of animals within litters, (5) difference between 
the means of the treatment groups (effect size), (6) 
significance cutoff (traditionally a — 0.05), and (7) 
the statistical test used. In order to illustrate the 
importance of the number of litters relative to the 
number of animals within litters, a power analysis 
was conducted with the number of litters per group 
varying from three to ten, and the number of ani- 
mals per litter varying from one to ten. The other 
factors were held constant. Variability between lit- 
ters (SD = 0.8803) and the variability of animals 
within litters (SD = 0.8142) was estimated from the 
locomotor activity data from Mehta et al. 40 . For 
each combination of litters and animals, 5000 simi- 
lated datasets were created with a mean difference 
between groups of 0.15. Once the datasets were gen- 
erated, the power for three types of analyses were 
calculated. The first analysis averaged the values 
of the animals within each litter, and then groups 
were compared with a t-test. The second analysis 
used a mixed-effects model, and the third ignored 
litter and just compared all of the values groups 
with a t-test. The last analysis is incorrect and only 
presented to demonstrate how artificially inflating 
sample size affects power. The power for each anal- 
ysis was determined as the proportion of tests that 
had p < 0.05. The R code is provided in Additional 



File 1 and is adapted from Gelman and Hill 42 



Results and Discussion 

Low quality of the published literature 

The VPA model of autism is relatively new and po- 
tential therapeutic compounds tested in this model 
have not yet advanced to human trials. The op- 
portunity therefore exists to clean up the literature 
and prevent a repeat of the SOD1 story. The main 
finding is that only 9% (3/34) of studies correctly 



identified the experimental unit and thus made valid 
inferences from the data. One study used a nested 
design 43 , the second mentioned that litter was 
the experimental unit 44 , and the third used one 



animal from each litter, thus bypassing the issues 
discussed [45]. For fourteen studies (41%) it was not 
possible to determine how many dams were actually 
used (i.e. the sample size), and in four studies (12%) 
the number of offspring used were not indicated. In 
addition, only four (12%) reported randomly assign- 
ing pregnant females to the VPA or control group. 
Many studies also used only a subset of the offspring 
from each litter, but often it was not mentioned how 
the offspring were selected. Only six studies (18%) 
reported that the investigator was blind to the ex- 
perimental condition when collecting the data. Ten 
studies (29%) did not indicate whether both male 
and female offspring were used. No study mentioned 
performing a power analysis to determine a suitable 
sample size to detect effects of a given magnitude — 
but this is probably fortuitous, given that only three 
studies correctly identified the experimental unit. It 
is possible that many studies did actually randomise 
and assess outcomes blind but simply did not report 
it. However, randomisation and blinding are cru- 
cial aspects for the validity of the results and their 
omission in manuscripts suggests that they were not 
used. This is further supported by studies showing 
that when manuscripts do not mention using ran- 
domisation or blinding, the estimated effects sizes 
are larger compared to studies that do mention using 
these methods, which indicative of bias 
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A number of papers had additional statistical or 
experimental design issues, ranging from trivial (e.g. 
reporting total degrees of freedom rather than resid- 
ual degrees of freedom for an F-statistic) to serious. 
These include treating individual neurons as the ex- 
perimental unit, which is distressingly common in 
electrophysiological studies but just as inappropri- 
ate as treating blood pressure values taken from left 
and right arms as n = 2, or chopping a single liver 
sample into ten pieces and treating the expression of 
a gene measured in each piece as n = 10 11 . If only 
it were so easy; clinical trials could be conducted 
with tens of patients rather than hundreds or thou- 
sands. Regulatory authorities are not fooled by such 
stratagems, but is seems many journal editors and 
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peer-reviewers are. 



In addition, the wording of two studies suggests 
that control dams did not receive a vehicle injec- 
tion, and thus any differences between groups may 
be partly due to the stress of handling and injec- 
For some studies, the reported degrees 
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of freedom did not correspond to what would be ex- 
pected based on the verbal description of the anal- 
ysis, and a number of studies did not correctly dis- 
tinguish between "within-subjects" and "between- 
subjccts" effects. A list of studies can be found in 
Additional File 3. 



variable. However, it is clear that there are large 
differences between litters (Figure 3A), indicating 
heterogeneity in the response from one litter to the 
next. When litter effects are taken into account, 
the mean of each litter is closer to zero. Also note 
that variance of the residuals (of) is reduced by 61% 
when litter is taken into account (p < 0.001). This 
is shown by the spread of the grey points around 
zero on the right side of each graph, which are clus- 
tered closer together in the second analysis. This 
means that litter accounted for 61% of the previ- 
ously unexplained variation in the data. Note that 
it would be impossible to determine whether litter 
effects are present if only one litter per treatment 
group was used because litter and treatment would 
be completely confounded. 



Estimating the magnitude of litter effects 

To estimate the extent to which litter effects are 
important and how they can affect the results, data 
originally published by Mehta et al. BO] were used, 
and experimental details can be found therein. Lo- 
comotor activity in the open field is shown in Figure 
2 for nine VPA and five saline injected controls lit- 
ters. Half of the animals from each condition were 
given MPEP (a mGluR5 receptor antagonist) or 
saline. There do not appear to be differences be- 
tween VPA and control groups, and a slight increase 
in activity due to MPEP. This effect of MPEP was 
not significant when litter effects were ignored (Fig- 
ure 2A; p = 0.082), but it was when adjusting for 
litter (Figure 2B; p = 0.011). In this case the shift in 
p-value was not large, but it happened to decrease 
it below the 0.05 threshold after the excess noise 
caused by litter-to-litter variation was removed. 

It may be difficult to determine whether litter 
effects are present by simply plotting the data by 
litter because the experimental factors — especially if 
they are large — may obscure the effects. It is there- 
fore better to remove the effect of the experimental 
factors first, and then plot the residual values versus 
litter. The y-axis for Figure 3 plots the residuals, 
which is the difference between the observed locomo- 
tor activity for each animal and the value predicted 
from a model containing group (VPA/saline) and 
condition (MPEP/saline) as factors (from Figure 
2A). The residuals should be pure noise, centred at 
zero, and should not be associated with any other 



A similar analyses was performed for other vari- 
ables and the results are displayed in Table 1. It is 
clear that litter-to-litter variation is important for 
a number of behavioural outcomes. It is also clear 
from Figure 3A how easy it is to get false positives 
with an inappropriate design and analysis. Image 
if an experiment was conducted with only one VPA 
and one saline litter, with ten animals from each, 
and that there is no overall effect of VPA on a par- 
ticular outcome. If the experimenter happened to 
select Litter A (saline) and Litter M (VPA) there 
would be a significant increase due to VPA, but if 
Litter D (saline) and Litter G (VPA) were selected, 
there would be a significant effect in the opposite 
direction! There are many combinations of a single 
saline and VPA litter that would lead to a signifi- 
cant difference between conditions. Having two or 
three litters per group instead of one will reduce the 
false positive rate, but it will still be much higher 
than 0.05 (3l. In addition, these apparent differences 
would not replicate with a properly designed exper- 
iment. 



How power is affected by the number of litters 
and animals 

Figure 4 shows the power for various combinations 
of number of litters and number of animals per lit- 
ter. This analysis is based on averaging the values 
for the animals within a litter and then comparing 
the groups with a t-test. It is clear that increasing 
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the number of animals per litter has little effect on 
power (the lines in Figure 4A are nearly fiat after 
two animals per litter) , whereas increasing the num- 
ber of litters results in a large increase power. The 
results for the mixed-effect model are nearly iden- 
tical, and the results of the inappropriate analysis 
which ignores litter shows increasing power with 
increasing number of animals per litter (Additional 
File 4). This is false power however, and is due to an 
artificially inflated sample size (pseudoreplication) , 
and will lead many false positive results. 

Some may object on ethical grounds to using so 
many litters and then taking only one or a few ani- 
mals from each, as there will be many additional an- 
imals that will not be used, and presumably culled. 
Certainly all of the animals could be used, but there 
is almost no increase in power after three animals per 
litter (at least for the locomotor data) and therefore 
it is a poor use of time and resources to include all 
of the animals. One could argue therefore that it is 
unethical to submit a greater number of animals to 
the experimental procedure if they contribute little 
or nothing to the result. One could also argue that 
it is even more unethical to use any animals for a 
flawed or severely underpowered study in the first 
place, and then to clutter the scientific literature 
with the results. One way to deal with this issue 
is to use the excess animals for other experiments. 
For example, a few animals per litter might be used 
for a novel behavioural task. Others may be used 
to test the effects of a therapeutic compound, and 
rest for a study looking at gene expression. This 
requires greater planning, organisation, and coor- 
dination, but it is possible. Another option is to 
purchase animals from a supplier and request that 
the animals come from different litters rather than 
have an in-house colony. 



How does litter-to-litter variation arise? 

Differences between litters could exist for a variety 
of reasons, including shared genes and shared prena- 
tal and early postnatal environments, but also due 
to age differences (it is difficult to control the time of 
mating), and because litters are convenient units to 
work with. For example, it is not unusual for litter- 
mates to be housed in the same cage, which means 



that animals within a litter also share not just their 
early, but also their adult environment. It is also 
often administratively easier to apply experimental 
treatments on a per-cage (and thus per-litter) basis 
rather than per-animal basis. For example, animals 
in cage A and C are treated while cage B and D 
are controls. Animals may also undergo behavioural 
testing on a per-cage basis; for example, animals are 
taken from the housing room to the testing room one 
cage at a time, tested, and then returned. Larger 
experiments may need to be conducted over several 
days, and it is often convenient to do four cages 
on one day and four on the next, rather than take 
half the animals from all eight cages on each day. 
At the end of the experiment animals may also be 
killed on a per-cage basis. Given that it may take 
many hours to kill the animals, remove brains, col- 
lect blood, etc., the values of many outcomes (e.g. 
gene expression, hormones and metabolites concen- 
tration, physiological parameters, etc.) will change 
due to circadian rhythms. All of these can lead to 
systematic differences between litters and can thus 
bias results and/or add noise to the data. 



There is an important distinction to be made 
between applying treatments to whole litters versus 
"natural" variation between litters. When a treat- 
ment is applied to a whole litter (e.g. VPA model of 
autism, maternal stress) then the litter is the exper- 
imental unit and thus the sample size is the number 
of litters. Therefore, by definition, litter needs to 
be included in the analysis if more than one animal 
per litter is used (or the values within a litter can 
be averaged). However, if multiple litters arc used 
but the treatment (s) are applied to the individual 
animals, experiments should be designed so that if 
litter effects exist, then valid inferences can still be 
made. In other words, litters should not be con- 
founded with other experimental variables, because 
it would be difficult or impossible to detect their 
influence and remove their effects. Whether litter 
is an important factor for any particular outcome is 
then an empirical question, and if it is not impor- 
tant then it need not be included in the analysis. 
However, the power to detect differences between 
litters will be low if only a few litters are used in 
the experiment, and therefore a non-significant test 
for litter effects should not be interpreted as the ab- 
sence of such effects. What should not be done is to 
analyse the data with and without litter and choose 
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the analysis that gives the "right" answer for the 
experimental variable of interest 34 . Flood et al. 



provide an nice example in the autism literature of 
an appropriate design followed by an analysis both 
with and without litter included 48 . They also 



found a strong effect of litter on brain mass. 



Four ways to improve basic and translational re- 
search 

Better training for biologists 

Most experimental biologists are not provided with 
sufficient training in experimental design and data 
analysis to be able to plan, conduct, and interpret 
the results of scientific investigations at the level 
required to consistently obtain valid results. The so- 
lution is straightforward but requires major changes 
in the education and training of biologists and it 
will take many years to implement. Nevertheless, 
this should be a longer-term goal for the biomedical 
research community. 



Make better use of statistical expertise 

An second solution is to have statisticians play a 
greater role in preclinical studies, including peer 
reviewing grant applications and manuscripts, as 
well as being part of scientific teams 49 . However, 



there are not enough statisticians with the appro- 
priate subject matter knowledge to fully meet this 
demand — just as it is difficult to do good science 
without a knowledge of statistics, it is difficult to 
perform a good analysis without knowledge of the 
science. In addition, this type of "project support" 
is often viewed by academic statisticians as a sec- 
ondary activity. Despite this, there is still scope for 
improving the quality of studies by making better 
use of statistical expertise. 



More detailed reporting of experimental methods 

Detailed reporting of how experiments were con- 
ducted, how data were analysed, how outliers were 
handled, whether all animals that entered the study 



completed it, and how the sample size was deter- 
mined are all required to assess whether the results 
of the study are valid, and a number of guidelines 
have been proposed which cover these points, includ- 
ing the National Institute of Neurological Disorders 



and Stroke (NINDS) guidelines [50], the Gold Stan- 
dard Publication Checklist loll, and the ARRIVE 



(Animals in Research: Reporting In Vivo Experi- 
ments) guidelines [52]. For example, ARRIVE items 
6 (Study design), 10 (Sample size), II (Allocating 
animals to experimental groups), and 13 (Statisti- 
cal methods) should a be mandatory requirement 
for all publications involving animals and could be 
included as a separate checklist that is submitted 
along with the manuscript, much like a conflict of 
interest or a transfer of copyright form. This would 
make it easier to spot any design and analysis issues 
by reviewers, editors, and other readers. In addi- 
tion, and more importantly, if scientists arc required 
to comment on how they randomised treatment al- 
location, or how they ensured that assessment of 
outcomes was blinded, then they will conduct their 
experiments accordingly if they plan on publishing 
in a journal that has these reporting requirements. 
Similarly, if researchers are required to state what 
the experimental unit is (e.g. litter, cage, individual 
animal, etc.), then they will be prompted to think 
hard about the issue and design better experiments, 
or seek advice. This recommendation will not only 
improve the quality of reporting, but it will also 
improve the quality of experiments, which is the 
real benefit. A final benefit is that it will make 
quantitative reviews/meta-analyses easier, because 
much of the key information will be on a single page. 



Make raw data available 

Another solution is to make the provision of raw 
data a requirement for acceptance of a manuscript; 
not "to make it available if someone asks for it", 
which is the current requirement for many journals, 
but uploaded as supplementary material or hosted 
by a third party data repository. None of the VPA 
studies provided the data that the conclusions were 
based on, making reanalysis impossible. Remark- 
ably, of the thirty-five studies published, only one 
provided the necessary information to conduct a 
power analysis to plan a future study [45] , and this 
was only because one animal per litter was used and 
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the necessary values could be extracted from the 
graphs. Datasets used in preclinical animal studies 
are typically small, do not have confidentiality is- 
sues associated with them, are unlikely to be used 
for further analyses by the original authors, and have 
no additional intellectual property issues associated 
with them given that the manuscript itself has been 
published. It is noteworthy that many journals re- 
quire microarray data to be uploaded to a publicly 
available repository (e.g. Gene Expression Omnibus 
or Array Express), but not the corresponding be- 
havioural or histological data. It is perhaps not 
surprising that there is a relationship between study 
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Publishing raw data can be taken as a signal that 
researchers stand behind their data and therefore 
their conclusions. Funding bodies should encourage 
this by requiring that data arising from the grant 
are made publicly available (with penalties for non- 
adherence) . 



The above suggestions would help ensure that 
appropriate design and analyses were used, and to 
make it easy to verify claims or to reanalyse data. 
Currently, it is often difficult to establish the for- 
mer and almost impossible to perform the latter. 
Moreover, it is clear that appropriate designs and 
analyses are often not used, making it difficult to 
give the benefit of the doubt to those studies with 
incomplete reporting of how experiments were con- 
ducted and data analysed. 



Conclusions 

While it is difficult to quantify the extent to which 
poor statistical practices hinder basic and transla- 
tional research, it is clear that a large inflation of 
false positive and false negative rates will only slow 
progress down. In addition, because of publication 
bias and researcher degrees of freedom, it is possible 
for a field to converge to the wrong answer. Experi- 
mental design and statistical issues are, in principle, 
fixable. Improving these will allow scientists to focus 
on creating and assessing the suitability of disease 
models and the efficacy of therapeutic interventions, 
which is challenging enough. 
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Figures 

Figure 1 - Defining the experimental unit 

Pregnant females are the experimental units because they are randomised to the treatment (e.g. valproic 
acid) or control conditions, and therefore n — 6 in this example. The three offspring within a litter will often 

be more alike than offspring from different litters ( ^thhT-^ilttl" wiSion" > ' anc ^ multiple offspring within 
a litter can be thought of as subsamples or "technical replicates" , even though these are the scientific unit 
of interest. Only the mean of the within-litter values are important when comparing treated and control 
groups. Using all of the offspring without averaging will result in an inflated sample size (pseudoreplication) 
when using standard analyses. Instead of averaging, one could randomly select only one animal from each 
litter, or use a nested or hierarchical model to appropriately partition the different sources of variation. The 
only way to increase sample size, and thus power, is to increase the number of litters used. 

Treatment Control 



Pregnant females C^n) (?— <3 (P^s) (?^s) 

randomised to (*•) ( * * ] 1**1 1**1 (**] 1**1 
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(from 3 animals 
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Average of 
animals within litters 
gives 6 values to be 
used for analysis 
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Figure 2 - Analysis with and without litter taken into account 

Nine pregnant female C57BL/6 mice were injected with 600 mg/kg VPA subcutaneously on E13, and five 
control females received vehicle injections. Half of the animals in each condition were also injected with 
either a mGluR5 receptor antagonist (MPEP) or saline. Total locomotor activity in the open field over a 
30min period at 8-9 weeks of age is shown. There there is slight increase in activity due to MPEP, but 
this was not significant when differences between litters were ignored (Two-way ANOVA: mean difference = 
0.60, F(l,44) = 3.17, p = 0.082). Adjusting for litter removed unexplained variation in the data, allowing 
the small difference between groups to become statistically significant (Hierarchical model: mean difference 
= 0.64, F(l,32) = 7.19, p = 0.011). Note how the values in the second graph have less variability around 
the group means; this increased precision increases the power of the statistical tests. Lines go through the 
mean of each group, and points are jittered in the x direction. 
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Litter ignored 



Litter effects removed 




Figure 3 - Visualising litter-to-litter variation 

The residuals represent the unexplained variation in the data after the effects of VPA and MPEP have been 
taken into account; they should be centred at zero and should not be associated with any other variable. 
However, the standard analysis (A) shows that when residuals are plotted against litter (x-axis) there are 
large differences between the different litters. In other words, there is another factor affecting the outcome 
besides the experimental factors of interest. The variance of the residuals (grey points on the right) is 
high (of = 1.29). The proper analysis (B) reduces the unexplained variation in the data by 61% (of = 
0.50; p < 0.001), which can be seen by the narrower spread of the grey points around zero, and the large 
differences between the litters has been eliminated. This reduction in "noise" allows smaller true "signals" 
to be detected. Error bars are SEM. Litters F and L only have one observation and thus no error bars. 
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Figure 4 - Power calculations for VPA experiments 

Panel A shows how power changes as the number of animals per litter increases from one to eight (x-axis) 
and the number of litters per group increases from three to ten (different lines). It is clear that increasing the 
number of animals per litter has only a modest effect on power, with little improvement after two animals. 
A two-group study with three litters per group and eight animals per litter (2 x 3 x 8 = 48 animals) will 
have only at 30% chance of detecting the effect. Whereas a study with ten litters per group and one animal 
per litter (2 x 10 x 1 = 20 animals) will have almost 80% power and use far fewer animals. Panel B shows 
the same data, but presented differently. Power for different combinations of litters and animals per litter 
is indicated by color (red = low power, white = high) and reference lines for 70%, 80% and 90% power are 
indicated. Note that these specific power values are only relevant for the locomotor activity task with a 
fixed effect size, and will have to be recalculated for other outcomes. However the general result (increasing 
litters is better than increasing the number of animals per litter) will apply for all outcomes. 

A B 




Number of animals per litter Number of animals per litter 
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Tables 

Table 1 - Importance of litter effects on body weight and behavioural tests. 

The p-value tests whether the litter-to-litter variation was significantly greater than zero. 



Variable Reduction in of P-value 



Locomotor activity 


61% 


<0.001 


Body weight 


50% 


0.003 


Marbles buried 


38% 


0.045 


Anxiety (open field) 


35% 


0.0504 


Grooming 


23% 


0.116 



a\ is the residual (unexplained) variation. 



Additional Files 

Additional file 1 — R code for the analyses and power calculations 

Code for the analyses and power calculations are given as a plain text file. 



Additional file 2 — Raw data 

Raw data from Mehta et al 
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including body weight, locomotor activity and anxiety measures from the 
open field test, grooming behaviour, and number of marbles buried in the marble-burying test. Details can 
be found in the original publication. 



Additional file 3 — List of VPA studies 

List of the thirty-four studies using the VPA rodent model of autism. 



Additional file 4 — Power analysis for the mixed-effects model and the incorrect analysis 

The interpretation of the graphs is the same as Figure 4 (main text) . Panels A and B are for the mixed-effects 
model and are nearly identical to the results for averaging the values within each litter and then using a 
t-test (Figure 4 main text). Panels C and D ignore litter and just compare all of the data with a t-test, 
which results in an artificially inflated sample size and inappropriately high power. 
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